Analyzing Singapore's Resident Population

Dataset source : https://data.gov.sg/dataset/resident-population-by-ethnicity-gender-and-age-group Dataset contains data of the number of residents in Singapore from 1957-2018 split by age and ethnicity

Importing the dataset

In [1]:
import pandas as pd
In [2]:
df = pd.read_csv("sng.csv")
In [3]:
df.head()
Out[3]:
year level_1 level_2 value
0 1957 Total Residents 0 - 4 Years 264727
1 1957 Total Residents 5 - 9 Years 218097
2 1957 Total Residents 10 - 14 Years 136280
3 1957 Total Residents 15 - 19 Years 135679
4 1957 Total Residents 20 - 24 Years 119266

The dataset contains number of residents split by ethnicity, gender and age. Age is in 5 year brackets.

Basic preprocessing

In [4]:
df.dtypes
Out[4]:
year        int64
level_1    object
level_2    object
value      object
dtype: object
In [5]:
df["value"] = pd.to_numeric(df["value"],errors="coerce")
In [6]:
df.dtypes
Out[6]:
year         int64
level_1     object
level_2     object
value      float64
dtype: object
In [7]:
df.isnull().sum().sum()
Out[7]:
2460

There are 2460 null values

In [8]:
df_nan = df[df.isna().any(axis=1)]
df_nan["level_2"].unique()
Out[8]:
array(['70 - 74 Years', '75 - 79 Years', '80 - 84 Years',
       '75 Years & Over', '80 Years & Over', '85 Years & Over'],
      dtype=object)

Since all the null values occur above the age of 70, it is safe to assume that null value means the value is 0. This is based on the assumption that if the null values occured due to errors in reporting or data entry they would be distributed randomly.

In [9]:
df = df.fillna(0)
In [10]:
df.iloc[344]
Out[10]:
year                  1958
level_1    Total Residents
level_2      70 - 74 Years
value                    0
Name: 344, dtype: object

Replaced null values with 0.
Identifying the different ethnic groups:

In [11]:
df["level_1"].unique()
Out[11]:
array(['Total Residents', 'Total Male Residents',
       'Total Female Residents', 'Total Malays', 'Total Male Malays',
       'Total Female Malays', 'Total Chinese', 'Total Male Chinese',
       'Total Female Chinese', 'Total Indians', 'Total Male Indians',
       'Total Female Indians', 'Other Ethnic Groups (Total)',
       'Other Ethnic Groups (Males)', 'Other Ethnic Groups (Females)'],
      dtype=object)

Identifying the largest ethnic group

Isolating the entries of the year 2018(last year data was recorded)

In [12]:
df_2018 = df[df["year"]==2018]
In [13]:
df_2018.head()
Out[13]:
year level_1 level_2 value
20130 2018 Total Residents 0 - 4 Years 185528.0
20131 2018 Total Residents 5 - 9 Years 199066.0
20132 2018 Total Residents 10 - 14 Years 206530.0
20133 2018 Total Residents 15 - 19 Years 226520.0
20134 2018 Total Residents 20 - 24 Years 255516.0

Only considering total number of residents of each ethnic group

In [14]:
df_2018_total = df_2018[df_2018["level_1"].isin(["Total Residents","Total Malays","Total Chinese","Total Indians","Other Ethnic Groups (Total)"])].reset_index(drop=True)
In [15]:
df_2018_total_grp = df_2018_total.groupby(["year","level_1"],as_index=False)["value"].sum()
In [16]:
df_2018_sorted = df_2018_total_grp.sort_values("value")
In [17]:
df_2018_sorted.reset_index(drop=True, inplace=True)
In [18]:
df_2018_sorted
Out[18]:
year level_1 value
0 2018 Other Ethnic Groups (Total) 144270.0
1 2018 Total Indians 428314.0
2 2018 Total Malays 636807.0
3 2018 Total Chinese 3975586.0
4 2018 Total Residents 5184977.0

The largest ethnic group in Singapore by total population (in 2018) is Chinese

In [19]:
percentage_chinese = (df_2018_sorted.iloc[3,2]/df_2018_sorted.iloc[4,2])*100
In [20]:
percentage_chinese
Out[20]:
76.67509421931862
In [23]:
import plotly.offline as pyo
pyo.init_notebook_mode()
In [24]:
import plotly.express as px
df_2018_sorted_mod = df_2018_sorted[df_2018_sorted.index != 4]
fig = px.pie(df_2018_sorted_mod, values='value', names='level_1', title='Population distribution of Singapore in 2018 by ethnicity')
fig.show()

In 2018, Chinese residents made up 76.67% of Singapore's total resident population

Analyzing the growth of the chinese population in Singapore

In [25]:
df_chinese = df[df['level_1'].isin(["Total Chinese","Total Male Chinese", "Total Female Chinese"])].reset_index(drop=True)
df_chinese_grp = df_chinese.groupby(["year","level_1"],as_index=False)["value"].sum()
df_chinese_grp.head()
Out[25]:
year level_1 value
0 1957 Total Chinese 1137408.0
1 1957 Total Female Chinese 565609.0
2 1957 Total Male Chinese 571799.0
3 1958 Total Chinese 1178100.0
4 1958 Total Female Chinese 582700.0
In [26]:
df_chinese = df_chinese_grp.pivot_table(values="value",index="year",columns="level_1")
df_chinese_mod = df_chinese.reset_index( drop=False, inplace=False )
df_chinese_mod.columns.names = [""]
In [27]:
df_chinese_mod.head()
Out[27]:
year Total Chinese Total Female Chinese Total Male Chinese
0 1957 1137408.0 565609.0 571799.0
1 1958 1178100.0 582700.0 595400.0
2 1959 1235500.0 611300.0 624200.0
3 1960 1284000.0 635900.0 648100.0
4 1961 1328500.0 658400.0 670100.0

population growth of a year = population in current year - population in previous year

In [28]:
growth_df = df_chinese.diff(axis=0)
growth_df.reset_index( drop=False, inplace=True )
growth_df.columns.names = [""]
In [29]:
growth_df = growth_df[growth_df.year != 1957]
growth_df.head()
Out[29]:
year Total Chinese Total Female Chinese Total Male Chinese
1 1958 40692.0 17091.0 23601.0
2 1959 57400.0 28600.0 28800.0
3 1960 48500.0 24600.0 23900.0
4 1961 44500.0 22500.0 22000.0
5 1962 39700.0 20500.0 19200.0

This dataframe shows population growth each year

In [30]:
avg_m_chinese = growth_df['Total Male Chinese'].mean()
avg_f_chinese = growth_df['Total Female Chinese'].mean()
avg_chinese = growth_df['Total Chinese'].mean()
print(avg_m_chinese, avg_f_chinese, avg_chinese)
21466.147540983606 25061.360655737706 46527.50819672131

The average year-over-year population growth for chinese male residents was 21466 residents per year
The average year-over-year population growth for chinese female residents was 25061 residents per year
The average year-over-year population growth for total chinese residents was 46527 residents per year

Plotting data of Chinese residents in Singapore

In [31]:
import plotly.graph_objs as go
trace1 = go.Scatter(
                    x = df_chinese_mod.year,
                    y = df_chinese_mod['Total Chinese'],
                    mode = "lines",
                    name = "Chinese population",
                    )
trace2 = go.Scatter(
                    x = df_chinese_mod.year,
                    y = df_chinese_mod['Total Male Chinese'],
                    mode = "lines",
                    name = "Male Chinese population",
                    )
trace3 = go.Scatter(
                    x = df_chinese_mod.year,
                    y = df_chinese_mod['Total Female Chinese'],
                    mode = "lines",
                    name = "Female Chinese population",
                    )
data = [trace1,trace2,trace3]
layout = dict(title = 'Chinese population in Singapore',
              xaxis= dict(title= 'Year'),
              yaxis=dict(title= 'Number of Chinese residents')
             )
fig = go.Figure(dict(data = data, layout = layout))
fig.show()

From this graph we can see that chinese population in Singapore has consistently increased since 1957, with the male and female populations exactly equal since 1980. Since then Chinese Female residents are higher in number.

In [32]:
import plotly.graph_objs as go
trace1 = go.Scatter(
                    x = growth_df.year,
                    y = growth_df['Total Chinese'],
                    mode = "lines",
                    name = "Chinese population growth",
                    )
trace2 = go.Scatter(
                    x = growth_df.year,
                    y = growth_df['Total Male Chinese'],
                    mode = "lines",
                    name = "Male Chinese population growth",
                    )
trace3 = go.Scatter(
                    x = growth_df.year,
                    y = growth_df['Total Female Chinese'],
                    mode = "lines",
                    name = "Female Chinese population growth",
                    )
data = [trace1,trace2,trace3]
layout = dict(title = 'Population growth per year',
              xaxis= dict(title= 'Year'),
              yaxis=dict(title= 'Population growth per year')
             )
fig = go.Figure(dict(data = data, layout = layout))
fig.show()

Since the mid 1980s the female chinese population growth is slightly more than the male chinese population growth. There was a huge spike in chinese residents in 1990, probably due to an external policy change. The chinese population in Singapore has grown almost every year since 1958 and declined only in 2003.

Identifying the largest age bracket

In [33]:
df_2018_age_grp = df_2018.groupby(["year","level_2"],as_index=False)["value"].sum()
In [34]:
df_2018_age_grp.head()
Out[34]:
year level_2 value
0 2018 0 - 4 Years 742112.0
1 2018 10 - 14 Years 826120.0
2 2018 15 - 19 Years 906080.0
3 2018 20 - 24 Years 1022064.0
4 2018 25 - 29 Years 1166524.0
In [35]:
df_2018_age_grp.rename({"level_2":"age_bracket"},axis=1, inplace=True)
df_2018_age_grp["age_bracket"].unique()
Out[35]:
array(['0  -  4 Years', '10 - 14 Years', '15 - 19 Years', '20 - 24 Years',
       '25 - 29 Years', '30 - 34 Years', '35 - 39 Years', '40 - 44 Years',
       '45 - 49 Years', '5  -  9 Years', '50 - 54 Years', '55 - 59 Years',
       '60 - 64 Years', '65 - 69 Years', '65 Years & Over',
       '70 - 74 Years', '70 Years & Over', '75 - 79 Years',
       '75 Years & Over', '80 - 84 Years', '80 Years & Over',
       '85 Years & Over'], dtype=object)

Removing data stored in the format "x Years & Over". This is because the same data is already stored in the corresponding age bracket, leading to doublecounting the data if not removed.

In [36]:
df_2018_age_grp = df_2018_age_grp[~df_2018_age_grp["age_bracket"].isin(["65 Years & Over","70 Years & Over","75 Years & Over", "80 Years & Over","85 Years & Over"])]
In [37]:
df_2018_age_sorted = df_2018_age_grp.sort_values("value",ascending=False)
In [38]:
df_2018_age_sorted.reset_index(drop=True, inplace=True)
df_2018_age_sorted.head()
Out[38]:
year age_bracket value
0 2018 50 - 54 Years 1235860.0
1 2018 45 - 49 Years 1231352.0
2 2018 55 - 59 Years 1217560.0
3 2018 35 - 39 Years 1214880.0
4 2018 40 - 44 Years 1214560.0

50-54 years old is the most common age bracket for residents in Singapore

In [39]:
fig = px.pie(df_2018_age_sorted, values='value', names='age_bracket', title='Population distribution of Singapore in 2018 (by age)')
fig.show()

50-54 years old residents comprise of 7.83% of the population.

Most residents in Singapore are middle aged, with the population split fairly evenly between the age of 30 and 60

Average population growth of the 50-54 age bracket

In [40]:
df_age = df[df['level_2']=="50 - 54 Years"].reset_index(drop=True)
df_age = df_age.groupby(["year","level_2"],as_index=False)["value"].sum()
df_age.drop("level_2",axis=1,inplace=True)
df_age.head()
Out[40]:
year value
0 1957 207176.0
1 1958 216800.0
2 1959 229200.0
3 1960 238400.0
4 1961 250400.0
In [41]:
diff_age = df_age["value"].diff()
age_growth_df = pd.DataFrame()
age_growth_df["year"] = df_age["year"]
age_growth_df["growth"] = diff_age
age_growth_df = age_growth_df[age_growth_df.year != 1957]
age_growth_df.head()
Out[41]:
year growth
1 1958 9624.0
2 1959 12400.0
3 1960 9200.0
4 1961 12000.0
5 1962 12400.0
In [42]:
avg_growth = age_growth_df["growth"].mean()
print(avg_growth)
16863.67213114754

The number of residents in the age bracket of 50-54 has an average increase of 16864 poeple a year

Plotting data of 50-54 year old residents in Singapore

In [43]:
import plotly.express as px
fig = px.line(age_growth_df, x='year', y='growth', title="Population growth vs Time in years of 50-54 year old  Singaporean residents")
fig.show()

The population growth seems to be pretty erratic with several peaks and dips. There is a prominent spike in the year 2000. The population of 50-54 year old residents have been declining since 2016. This has happenned twice in the past - at 1995 and at 1970

In [44]:
import plotly.express as px
fig = px.line(df_age, x='year', y='value', title="Population vs Time in years of 50-54 year old Singaporean residents")
fig.show()

The general trend is an increase in the population of 50-54 year olds but the trend seems to have reversed in recent years.

Q3) Group with highest average growth per year

In order to find group with highest growth rate we need to compare the data from 1958 and 2018

In [66]:
df_1958 = df[df["year"]==1958].reset_index(drop=True)
In [67]:
df_1958.head()
Out[67]:
year level_1 level_2 value
0 1958 Total Residents 0 - 4 Years 277800.0
1 1958 Total Residents 5 - 9 Years 229600.0
2 1958 Total Residents 10 - 14 Years 157100.0
3 1958 Total Residents 15 - 19 Years 134300.0
4 1958 Total Residents 20 - 24 Years 124900.0
In [68]:
df_2018.reset_index(drop=True, inplace=True)
df_2018.head()
Out[68]:
year level_1 level_2 value
0 2018 Total Residents 0 - 4 Years 185528.0
1 2018 Total Residents 5 - 9 Years 199066.0
2 2018 Total Residents 10 - 14 Years 206530.0
3 2018 Total Residents 15 - 19 Years 226520.0
4 2018 Total Residents 20 - 24 Years 255516.0
In [69]:
pop_change_df = pd.DataFrame()
pop_change_df[["type","age","1958"]] = df_1958[["level_1","level_2","value"]]
pop_change_df["2018"] = df_2018["value"]

pop_change_df["change"] = df_2018["value"]-df_1958["value"]
In [70]:
pop_change_df.head()
Out[70]:
type age 1958 2018 change
0 Total Residents 0 - 4 Years 277800.0 185528.0 -92272.0
1 Total Residents 5 - 9 Years 229600.0 199066.0 -30534.0
2 Total Residents 10 - 14 Years 157100.0 206530.0 49430.0
3 Total Residents 15 - 19 Years 134300.0 226520.0 92220.0
4 Total Residents 20 - 24 Years 124900.0 255516.0 130616.0

Average growth is (population change)/number of years,

In [71]:
pop_change_df["growth_rate"] = pop_change_df["change"]/60
In [72]:
pop_change_df.head()
Out[72]:
type age 1958 2018 change growth_rate
0 Total Residents 0 - 4 Years 277800.0 185528.0 -92272.0 -1537.866667
1 Total Residents 5 - 9 Years 229600.0 199066.0 -30534.0 -508.900000
2 Total Residents 10 - 14 Years 157100.0 206530.0 49430.0 823.833333
3 Total Residents 15 - 19 Years 134300.0 226520.0 92220.0 1537.000000
4 Total Residents 20 - 24 Years 124900.0 255516.0 130616.0 2176.933333
In [73]:
print(pop_change_df["growth_rate"].max())
print(pop_change_df["growth_rate"].idxmax())
8569.233333333334
17
In [74]:
pop_change_df.iloc[17]
Out[74]:
type           Total Residents
age            65 Years & Over
1958                     33700
2018                    547854
change                  514154
growth_rate            8569.23
Name: 17, dtype: object

The 65 Years and Over sector has seen the highest growth rate from 1958 to 2018

Removing aggregate categories,

In [75]:
pop_change_df = pop_change_df[~pop_change_df["age"].isin(["65 Years & Over","70 Years & Over","75 Years & Over", "80 Years & Over","85 Years & Over"])]
In [76]:
pop_change_df = pop_change_df[~pop_change_df["type"].isin(["Total Residents","Total Male Residents","Total Female Residents","Total Malays","Total Chinese","Total Indians","Other Ethnic Groups (Total)"])]
In [77]:
pop_change_df.head()
Out[77]:
type age 1958 2018 change growth_rate
88 Total Male Malays 0 - 4 Years 22700.0 18624.0 -4076.0 -67.933333
89 Total Male Malays 5 - 9 Years 15900.0 15878.0 -22.0 -0.366667
90 Total Male Malays 10 - 14 Years 9900.0 17117.0 7217.0 120.283333
91 Total Male Malays 15 - 19 Years 7800.0 20922.0 13122.0 218.700000
92 Total Male Malays 20 - 24 Years 10900.0 23682.0 12782.0 213.033333
In [78]:
pop_change_df.sort_values("growth_rate", ascending=False).head(3).reset_index(drop=True)
Out[78]:
type age 1958 2018 change growth_rate
0 Total Female Chinese 55 - 59 Years 15800.0 116668.0 100868.0 1681.133333
1 Total Male Chinese 55 - 59 Years 17000.0 115822.0 98822.0 1647.033333
2 Total Female Chinese 60 - 64 Years 11300.0 109436.0 98136.0 1635.600000

The sector that saw the highest growth rate was 55-59 year old Chinese women with an average increase of 1681 women per year

In [59]:
pop_change_df[(pop_change_df['growth_rate'] >= -0.5) & (pop_change_df['growth_rate'] <= 0.5)]
Out[59]:
type age 1958 2018 change growth_rate
89 Total Male Malays 5 - 9 Years 15900.0 15878.0 -22.0 -0.366667

Only Malay males between ages of 5-9 years has a rate of population change between -0.5 and 0.5 ie population remained relatively unchanged

In [79]:
df_trends = df[df["level_1"].isin(["Total Residents","Total Male Residents","Total Female Residents","Total Malays","Total Chinese","Total Indians","Other Ethnic Groups (Total)"])]
In [80]:
df_trends = df_trends.rename(columns = {"value":"total_population"})
In [81]:
df_trends["total_population"].describe()
Out[81]:
count      9548.000000
mean      60381.899979
std       75772.417615
min           0.000000
25%        3425.750000
50%       26636.000000
75%      102016.500000
max      547854.000000
Name: total_population, dtype: float64
In [82]:
df_trends = df_trends.groupby(["year","level_1"],as_index=False)["total_population"].sum()
In [83]:
df_trends = df_trends.rename(columns={"level_1":"ethnicity"})
In [84]:
df_trends.head()
Out[84]:
year ethnicity total_population
0 1957 Other Ethnic Groups (Total) 35976.0
1 1957 Total Chinese 1137408.0
2 1957 Total Female Residents 718610.0
3 1957 Total Indians 125912.0
4 1957 Total Malays 202907.0
In [85]:
fig = px.scatter(df_trends, x="year", y="total_population", color="ethnicity", trendline="ols")
fig.show()

The fastest growing populations in descending order are: Chinese, Malay, Indians, others

Age distribution over time

Removing aggregates

In [86]:
df_age = df[~df["level_1"].isin(["Total Residents","Total Male Residents","Total Female Residents","Total Malays","Total Chinese","Total Indians","Other Ethnic Groups (Total)"])]
In [87]:
df_age = df_age[~df_age["level_2"].isin(["65 Years & Over","70 Years & Over","75 Years & Over", "80 Years & Over","85 Years & Over"])]
In [88]:
df_age = df_age.rename(columns={"level_1":"ethnicity", "level_2":"age", "value":"population"})
In [89]:
fig = px.bar(df_age, x="age", y="population", color="ethnicity",
  animation_frame="year",  range_y=[0,1000000])
fig.show()

It can be observed how Singapore's population was initially skewed to younger ages. However, as time passed and Singapore went from a third-world country to a first-world country, the middle aged population exploded. This can be attributed to the large number of foreigners who migrated to Singapore in search of a better quality of life and financial opportunities.

In [ ]: